In this project, we are examining the growth of in the use of the term “diversity”. To do this, we drew from the MEDLINE database in Web of Science, using the search terms “TS=(diversity)” from 1990-2018 for human research only. This search provided 71,528 total results, which we extracted using the bibliometrix package in R. Next, we converted the abstracts of these articles to a text corpus and then used tidytext - a package designed for computational text analysis in R - to analyze patterns with the abstracts of these data. Below is the replication code for these analyses…
In this first chunk, we load our data and examine the overall growth of articles from our search query. As we can see, there is a pretty sizable growth in scientific literature that uses the term diversity - from about 500 times in 1990 to over 5000 in 2017. The mild drop in publications in 2018 is likely the result of missing data in the Web of Science database, which is a trend we see throughout all of our analyses.
text_data <- read_csv("text_data.csv")
## Parsed with column specification:
## cols(
## id = col_double(),
## author = col_character(),
## title = col_character(),
## publication = col_character(),
## abstract = col_character(),
## year = col_double(),
## department = col_character(),
## subject = col_character(),
## grant_information = col_character(),
## keyword = col_character(),
## pubmed_id = col_character(),
## doi = col_character(),
## country = col_character()
## )
# checking to see how the overall data looks
by_year <- text_data %>%
group_by(year) %>% count(year, sort = TRUE) %>% ungroup()
by_year <- ggplot() + geom_line(aes(y = n, x = year), data = by_year, stat="identity") +
labs(title = "overall growth in diversity-related articles from 1990-2018") +
theme(axis.title.x = element_blank(), axis.title.y = element_blank())
by_year <- ggplotly(by_year); by_year
Next, we want to look at word frequencies by year in the literature. This chunk of code breaks down how common words occur in the abstracts of our dataset. Note that we also remove some frequently occurring words that are not really relevant to our dataset, but these do not systematically alter our results. You should be able to click through these tables to gain more insight.
# tokenizing the abstract data into words
abstract_data <- text_data %>%
unnest_tokens(word, abstract) %>%
anti_join(stop_words)
## Joining, by = "word"
# most frequent word count in abstracts
abstract_data %>%
count(word, sort = TRUE)
## # A tibble: 159,784 x 2
## word n
## <chr> <int>
## 1 diversity 77112
## 2 study 34721
## 3 patients 32787
## 4 human 32166
## 5 results 29482
## 6 1 27696
## 7 genetic 27652
## 8 health 25787
## 9 cell 25463
## 10 analysis 25431
## # … with 159,774 more rows
# adding custom set of stopwords
my_stopwords <- tibble(word = c(as.character(1:9),
"1", "2", "3", "4", "rights", "reserved",
"copyright", "elsevier", "5", "10"))
abstract_data <- abstract_data %>% anti_join(my_stopwords)
## Joining, by = "word"
# looking at word frequencies by year
abstract_words <- abstract_data %>%
group_by(year) %>%
count(word, sort = TRUE) %>% ungroup(); abstract_words
## # A tibble: 690,947 x 3
## year word n
## <dbl> <chr> <int>
## 1 2017 diversity 6399
## 2 2016 diversity 6154
## 3 2015 diversity 5875
## 4 2018 diversity 5813
## 5 2014 diversity 5129
## 6 2013 diversity 5036
## 7 2012 diversity 4693
## 8 2011 diversity 4055
## 9 2010 diversity 3794
## 10 2009 diversity 3336
## # … with 690,937 more rows
Now, we want to look at how the most relevant words vary over time. Brandon chose to include words like diversity, genetic, and population as well as racially-specific and ethnically-specific terms. As we see, the rise of diversity does not necessarily mean that the focus on race or ethnicity is growing in congruence with that term. This could mean that diversity is being used as a catch-all in the scientific literature (i.e. that the multiplicity of the term makes it mean anything and everything) or that diversity is most used in fields like immunology or oncology. We will explore that hypothesis a bit more below.
diversity_terms <- abstract_words %>%
filter(word == "diversity" | word == 'genetic' | word == "population" |
word == "ethnic" | word == "racial" | word == 'race' |
word == 'caucasian' | word == 'african' | word == 'black')
diversity_terms
## # A tibble: 261 x 3
## year word n
## <dbl> <chr> <int>
## 1 2017 diversity 6399
## 2 2016 diversity 6154
## 3 2015 diversity 5875
## 4 2018 diversity 5813
## 5 2014 diversity 5129
## 6 2013 diversity 5036
## 7 2012 diversity 4693
## 8 2011 diversity 4055
## 9 2010 diversity 3794
## 10 2009 diversity 3336
## # … with 251 more rows
word_graph <- ggplot() + geom_line(aes(y = n, x = year, colour = word),
data = diversity_terms, stat="identity") +
labs(title = "growth in the use of diversity-related terms over time") +
theme(axis.title.x = element_blank(), axis.title.y = element_blank())
interactive_graph <- ggplotly(word_graph); interactive_graph
We were also interested in the terms that most commonly occured alongside diversity. We can do this by running pairwise counts across the abstracts and then using network analysis to map what those relationships look like. In the graph presented below, the nodes correspond to commonly occuring words in our abstract dataset with the strength of the lines between the nodes aligning with the frequency those words arise in the same abstract. Looks like this graph may need some editing to remove non-theoretically relevant terms.
# co-occurence count of abstracts
abstract_pairs <- abstract_data %>%
pairwise_count(word, id, sort = TRUE, upper = FALSE)
abstract_pairs
## # A tibble: 62,711,759 x 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 diversity results 20505
## 2 diversity study 19995
## 3 diversity human 15952
## 4 diversity analysis 15309
## 5 diversity data 13729
## 6 diversity genetic 13281
## 7 diversity based 12362
## 8 diversity studies 12099
## 9 diversity methods 12070
## 10 results study 11859
## # … with 62,711,749 more rows
# network visualization of most frequent pairs
set.seed(1234)
abstract_pairs %>%
filter(n >= 5000) %>% # may need to alter this number for a cutoff point
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 3) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) + theme_void()
The next step is to look at how the concept of “diversity” is used across the world. As the graph below demonstrates, the rise of “diversity” seems mostly to grow in the context of predominantly White, Westernized countries like US, England, the Netherlands, Germany and Switzerland.
text_data$country <- tolower(text_data$country)
# tokenizing the abstract data into words
abstract_data <- text_data %>%
unnest_tokens(word, abstract) %>%
anti_join(stop_words)
## Joining, by = "word"
# most frequent word count in abstracts
abstract_data %>%
count(word, sort = TRUE)
## # A tibble: 159,784 x 2
## word n
## <chr> <int>
## 1 diversity 77112
## 2 study 34721
## 3 patients 32787
## 4 human 32166
## 5 results 29482
## 6 1 27696
## 7 genetic 27652
## 8 health 25787
## 9 cell 25463
## 10 analysis 25431
## # … with 159,774 more rows
# looking at word frequencies by year
diversity_by_country <- abstract_data %>%
group_by(year) %>%
count(word, country, sort = TRUE) %>% ungroup(); diversity_by_country
## # A tibble: 1,528,401 x 4
## year word country n
## <dbl> <chr> <chr> <int>
## 1 2017 diversity united states 2781
## 2 2016 diversity united states 2753
## 3 2015 diversity united states 2635
## 4 2014 diversity united states 2464
## 5 2013 diversity united states 2401
## 6 2012 diversity united states 2324
## 7 2018 diversity united states 2312
## 8 2017 diversity england 2261
## 9 2018 diversity england 2005
## 10 2016 diversity england 1982
## # … with 1,528,391 more rows
diversity_by_country <- diversity_by_country %>%
filter(word == "diversity")
diversity_by_country
## # A tibble: 1,044 x 4
## year word country n
## <dbl> <chr> <chr> <int>
## 1 2017 diversity united states 2781
## 2 2016 diversity united states 2753
## 3 2015 diversity united states 2635
## 4 2014 diversity united states 2464
## 5 2013 diversity united states 2401
## 6 2012 diversity united states 2324
## 7 2018 diversity united states 2312
## 8 2017 diversity england 2261
## 9 2018 diversity england 2005
## 10 2016 diversity england 1982
## # … with 1,034 more rows
diversity_over_time <- ggplot() + geom_line(aes(y = n, x = year, colour = country),
data = diversity_by_country, stat="identity") +
labs(title = "growth in the use of diversity over time (by country)") +
theme(axis.title.x = element_blank(), axis.title.y = element_blank())
diversity_over_time <- ggplotly(diversity_over_time); diversity_over_time
Lastly, we wanted to look more into the growth of diversity related terms by scientific subject matter. This snippet of code breaks down the number of words occuring in abstracts over time, which is then broken down by MEDLINE’s and Web of Science’s subject categories. I have opted to only include 12 of the 150 different categories that could have been graphed here. Overall, we see that the rise of diversity in genetics & heredity, biochemistry & molecular biology, microbiology, immunology, and infectious disease research. We do not see this same rise in the social sciences, though there admittedly is some overall growth in that domain.
text_data <- text_data %>%
separate(subject, into = paste("subject", 1:15, sep = "_"), sep = ";") %>%
gather(value, subject, subject_1:subject_15, na.rm = TRUE) %>% select(-value)
text_data <- text_data %>%
separate(subject, into = c("subject", "void"), sep = "[(]") %>% select(-void)
text_data$subject <- stri_trim_both(text_data$subject)
text_data$subject <- tolower(text_data$subject)
# unique(text_data$subject)
subject_data <- text_data %>%
unnest_tokens(word, abstract) %>%
anti_join(stop_words)
subject_data %>%
count(word, sort = TRUE)
## # A tibble: 159,751 x 2
## word n
## <chr> <int>
## 1 diversity 437762
## 2 study 217375
## 3 patients 214820
## 4 1 180957
## 5 human 180386
## 6 results 176187
## 7 health 165982
## 8 genetic 157316
## 9 analysis 151810
## 10 isolates 151303
## # … with 159,741 more rows
growth_by_subject <- subject_data %>%
group_by(year) %>%
count(word, subject, sort = TRUE) %>% ungroup()
subject_data %>%
count(subject, sort = TRUE)
## # A tibble: 134 x 2
## subject n
## <chr> <int>
## 1 genetics & heredity 4213638
## 2 biochemistry & molecular biology 4016585
## 3 microbiology 2216691
## 4 immunology 1986325
## 5 infectious diseases 1857374
## 6 pharmacology & pharmacy 1466127
## 7 behavioral sciences 1464061
## 8 psychology 1356911
## 9 pediatrics 1344591
## 10 cell biology 1321172
## # … with 124 more rows
graph_by_subject <- growth_by_subject %>%
filter(word == "diversity") %>%
filter(subject == "genetics & heredity" | subject == "biochemistry & molecular biology" |
subject == "microbiology" | subject == "infectious diseases" | subject == "immunology" |
subject == "pharmacology & pharmacy" | subject == "behavioral sciences" |
subject == "health care sciences & services" | subject == "neurosciences & neurology" |
subject == "psychology" | subject == "sociology" |
subject == "oncology" | subject == "business & economics"
)
graph_by_subject <- ggplot() + geom_line(aes(y = n, x = year, colour = subject),
data = graph_by_subject, stat="identity") +
labs(title = "growth in the use of diversity over time (by subject)") +
theme(axis.title.x = element_blank(), axis.title.y = element_blank(), legend.title = element_blank())
graph_by_subject <- ggplotly(graph_by_subject); graph_by_subject
Overall, this document shows a rise in the use of “diversity” across scientific research. We see a 10-fold increase across the 1990s and 2000’s, which mostly unfolds in research deriving from Westernized biomedical scientific research. Our future analyses will examine more what implications this has for the use of diversity in and outside of that domain.